Skip to content

feat(gemini): add GeminiDenseEmbedder text embedding provider#751

Open
chethanuk wants to merge 8 commits intovolcengine:mainfrom
chethanuk:feat/gemini-landing
Open

feat(gemini): add GeminiDenseEmbedder text embedding provider#751
chethanuk wants to merge 8 commits intovolcengine:mainfrom
chethanuk:feat/gemini-landing

Conversation

@chethanuk
Copy link
Contributor

@chethanuk chethanuk commented Mar 18, 2026

Description

Phase 1 of #566: Adds GeminiDenseEmbedder — a production-ready, opt-in text-only dense embedding provider backed by Google's google-genai SDK. Volcengine remains the default; Gemini is enabled via provider: "gemini" in ov.conf. No existing behaviour changes.

This PR implements Phase 1 (text-only). Phase 2 (multimodal embedding) is tracked in #566 — the issue remains open.

Phased Implementation

Phase Scope Status
Phase 1 (this PR) Text-only dense embedding: 8 task types, MRL 1-3072, SDK retry, async batching, non-symmetric routing via #702 pattern
Phase 2 (future) Multimodal embedding: image/audio/video via Parts API, cross-modal retrieval, align with #718 approach Planned

Key capabilities

  • Text-onlysupports_multimodal = False; multimodal is Phase 2
  • Non-symmetric routingget_query_embedder() / get_document_embedder() route query_param/document_param to Gemini task types via factory (follows feat(embedding): combine document embedder and query embedder to avoi… #702 pattern)
  • Per-call task_type + title — override at call time; all 8 Gemini task types
  • Case-insensitive configtask_type, query_param, document_param auto-uppercased in config normalization
  • SDK-native retryHttpRetryOptions(attempts=3, exp_base=2, max_delay=30s) with import guard for SDK < 0.8
  • Async concurrent batchingasync_embed_batch() dispatches 100-text chunks in parallel via anyio
  • Empty-text guard — returns zero-vectors with warning log; preserves index alignment
flowchart LR
    subgraph Config["ov.conf / EmbeddingModelConfig"]
        CFG["provider: gemini\napi_key: sk-…\ntask_type: RETRIEVAL_DOCUMENT\ndimension: 1536"]
    end

    subgraph Embedder["GeminiDenseEmbedder(DenseEmbedderBase)"]
        direction TB
        BC["_build_config(task_type?, title?)"]
        E["embed(text, task_type?, title?)"]
        EB["embed_batch(texts, titles?)"]
        AEB["async_embed_batch(texts)"]
        E --> BC
        EB --> BC
        AEB --> BC
    end

    subgraph SDK["google-genai SDK"]
        direction TB
        HTTP["HttpOptions\n(retry: 3× exp backoff)"]
        SYNC["models.embed_content()"]
        ASYNC["aio.models.embed_content()\n(semaphore-bounded)"]
        HTTP --> SYNC
        HTTP --> ASYNC
    end

    Config --> Embedder
    BC --> SYNC
    AEB --> ASYNC
Loading
image

Related Issue

Partial implementation of #566 (Phase 1: text-only dense embedding; Phase 2: multimodal — issue stays open)

Type of Change

  • Bug fix
  • New feature (non-breaking, opt-in via config)
  • Breaking change
  • Documentation update
  • Refactoring
  • Performance improvement
  • Test update

Changes Made

  • openviking/models/embedder/gemini_embedders.pyGeminiDenseEmbedder(DenseEmbedderBase): _build_config(), per-call task_type/title, SDK retry, __repr__, empty-text guard with warning log
  • openviking_cli/utils/config/embedding_config.py — registers "gemini" provider; validates + auto-uppercases task_type/query_param/document_param; factory routes context for non-symmetric mode
  • openviking/models/embedder/__init__.py — exports GeminiDenseEmbedder; annotation: "text-only; multimodal planned"
  • examples/ov.conf.example — Gemini config snippet
  • pyproject.tomlgoogle-genai>=0.8 dep; optional gemini-async extra for anyio
  • tests/unit/test_gemini_embedder.py — 41 unit tests (mock-only, no API key needed)
  • tests/unit/test_embedding_config_gemini.py — 18 config validation tests (incl. case-insensitive task_type)
  • tests/integration/ — 49 integration tests (auto-skip without GOOGLE_API_KEY)

Testing

Suite Tests Live key required
test_gemini_embedder.py 41 No
test_embedding_config_gemini.py 18 No
test_gemini_e2e.py 7 Yes
test_gemini_embedding_it.py 22 Yes
test_gemini_openviking_it.py 20 Yes
108 passed, 0 skipped  ← with GOOGLE_API_KEY
 59 passed, 49 skipped ← without GOOGLE_API_KEY (CI)

Checklist

  • Code follows project coding style (ruff + black formatted)
  • Self-review completed
  • Tests added and passing
  • Tested on Linux and macOS

- Adds GeminiDenseEmbedder backed by google-genai>=0.8.0 SDK
- Supports all 8 task types, MRL dimensions 1-3072, async batch via anyio
- Wires gemini provider into EmbeddingModelConfig factory + context routing
- Adds unit tests (mocked, no API key) and IT tests (auto-skip without GOOGLE_API_KEY)
- Updates pyproject.toml with google-genai, anyio, pytest-asyncio, pytest-cov deps
…t from IT test

- GeminiDenseEmbedder.supports_multimodal = False (text-only provider)
- Remove ModalContent/Vectorize import from test_gemini_e2e.py (doesn't exist in gem-v1)
- Remove TestGeminiE2EMultimodalEmbedding class (multimodal out of scope for text-only slice)
- embed_batch() fallback: pass is_query through to self.embed()
- Remove redundant _l2_normalize() calls in embed/embed_batch/async_embed_batch;
  Gemini API already returns L2-normalized vectors, truncate_and_normalize is sufficient
- Remove now-dead _l2_normalize() function and unused math import
- embedding_config.py: fix typo "Secretfor" → "Secret for" in sk field description
- conftest.py: narrow broad except Exception to (ImportError, ModuleNotFoundError, AttributeError)
- test_gemini_embedding_it.py: construct fake API key dynamically to avoid hardcoded secret pattern
- test_gemini_openviking_it.py: add dimension assertion in test_dimension_variant_add_search;
  fix tautology assertion (>= 0 → > 0) in test_session_search_smoke
…dd pytest-xdist

- embed()/embed_batch(): return zero vector immediately for empty/whitespace-only
  text, avoiding non-descriptive Gemini API errors
- embed_batch(): filter empty strings pre-API call; reassemble results in original
  order so callers receive a zero vector at each empty position
- Remove _l2_normalize() function and all call sites (embed, embed_batch,
  async_embed_batch): Gemini API already returns L2-normalized vectors;
  truncate_and_normalize() is sufficient
- Remove unused `import math` (was only used by _l2_normalize)
- Add pytest-xdist>=3.0 to [dependency-groups].dev for parallel test runs
- Add 2 unit tests: test_embed_empty_string_returns_zero_vector,
  test_embed_batch_skips_empty_strings (50 unit tests total)
…ry, repr

- Add _build_config() to merge per-call task_type/title with instance defaults
- embed() gains keyword-only task_type= and title= overrides
- embed_batch() gains keyword-only task_type= and titles=; falls back to
  per-item embed() calls when titles is provided (per-document metadata)
- async_embed_batch() switches to _build_config() (no new args)
- Client constructed with HttpOptions(retry_options=HttpRetryOptions(
  attempts=3, initial_delay=1.0, max_delay=30.0, exp_base=2.0))
  when SDK >= 0.8.0; falls back to plain Client on ImportError
- Add __repr__: GeminiDenseEmbedder(model=..., dim=..., task_type=...)
- Remove now-dead self._embed_config from __init__
- 8 new unit tests in TestBuildConfig; fix test_init_stores_fields
  assertion to allow http_options kwarg
@chethanuk
Copy link
Contributor Author

@qin-ctx Please review and merge this

image

@ZaynJarvis
Copy link
Collaborator

#702 is megred for query and document embedding support.

get_query_embedder() is not the expected usage.

btw check if you can improve based on #718

Copy link
Contributor

@qin-ptr qin-ptr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

This PR adds Google Gemini as a text embedding provider with excellent test coverage (107 tests) and solid error handling. However, there are three blocking issues that need to be addressed:

  1. Design mismatch: Issue #566 explicitly requests multimodal retrieval support, but this PR only implements text-only embedding (supports_multimodal = False). The PR description should clarify this is a phased implementation with a roadmap for multimodal support, or change the issue reference to indicate partial completion.

  2. Documentation inconsistency: Code comment claims "multimodal" but implementation is text-only.

  3. CI failure: lint job failed due to formatting issues in test_gemini_embedder.py.

Once these are resolved, the PR will be ready to merge. The implementation quality is strong, with comprehensive testing, graceful error handling, and proper SDK retry integration.

🤖 I am a bot owned by @qin-ctx.

- OpenAI: Dense only
- Volcengine: Dense, Sparse, Hybrid
- Jina AI: Dense only
- Google Gemini: Dense only (multimodal)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Bug] (blocking)

Comment says "Google Gemini: Dense only (multimodal)" but the actual implementation has GeminiDenseEmbedder.supports_multimodal = False. This is contradictory.

Should either:

  • Change to "Google Gemini: Dense only (text-only; multimodal planned)"
  • Or remove the "(multimodal)" annotation entirely until multimodal support is actually implemented

# text-embedding-004: 768 fixed-dim legacy model, does not support MRL truncation
# Future gemini-embedding-*: default 3072 via _default_dimension() fallback
# Future text-embedding-*: default 768 via _default_dimension() prefix rule
supports_multimodal: bool = False # text-only; multimodal planned separately
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Design] (blocking)

This PR claims to close issue #566, which explicitly requests "first-class Gemini Embedding 2 support for multimodal retrieval". The issue emphasizes:

  • "it should not reduce everything to plain text before embedding"
  • "multimodal inputs handled as multimodal, not flattened by default"

However, this implementation sets supports_multimodal = False and only handles text. This is a fundamental mismatch between the issue requirement and the PR implementation.

Recommendation: Either:

  1. Update the PR description to clarify this is phase 1 (text-only) with a roadmap for phase 2 (multimodal), and change the issue reference to "Partial implementation of [Feature]: Add first-class Gemini Embedding 2 support for multimodal retrieval #566", keeping the issue open
  2. Or implement multimodal support in this PR as originally requested by the issue

The current "Closes: #566" statement is misleading and will cause users to expect multimodal functionality that doesn't exist.

@@ -0,0 +1,479 @@
# Copyright (c) 2026 Beijing Volcano Engine Technology Co., Ltd.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Bug] (blocking)

CI lint job failed because this file needs formatting:

Would reformat: tests/unit/test_gemini_embedder.py

Please run black tests/unit/test_gemini_embedder.py and commit the formatted version.

if backend is not None and provider is None:
data["provider"] = backend
for key in ("query_param", "document_param"):
for key in ("input_type",):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] (non-blocking)

The for key in ("input_type",): loop normalizes input_type to lowercase, but the new Gemini provider uses task_type instead of input_type. Consider also normalizing task_type here, or document that task_type must be uppercase (which is currently enforced in the Gemini validator at line 176).

task_type: Optional[str] = None,
title: Optional[str] = None,
) -> EmbedResult:
if not text or not text.strip():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] (non-blocking)

Returning a zero vector for empty text is a reasonable fallback, but it happens silently. Consider logging a warning so callers know their empty input was handled specially:

if not text or not text.strip():
    logger.warning("Empty text passed to embed(), returning zero vector")
    return EmbedResult(dense_vector=[0.0] * self._dimension)

This helps with debugging when unexpected empty strings appear in production.

…fig normalization, empty-text warning

- Fix "(multimodal)" annotation to "(text-only; multimodal planned)" to match supports_multimodal=False (B1)
- Run black formatter on test_gemini_embedder.py (B3)
- Add auto-uppercase normalization for task_type/query_param/document_param in sync_provider_backend so lowercase config values pass validation (NB1)
- Add logger.warning on empty text in embed() for production debuggability (NB2)
- Add test_gemini_task_type_case_insensitive test (NB1)
- Fix ruff import sorting in __init__.py
@chethanuk
Copy link
Contributor Author

Thanks for the thorough review! All issues addressed in commit bcfd612:

@qin-ptr — blocking fixes

@qin-ptr — non-blocking (adopted both)

  • NB1 task_type normalization: Added auto-uppercase in sync_provider_backend for task_type, query_param, document_param. Users can now write task_type: "retrieval_document" (any case) in config. Added test test_gemini_task_type_case_insensitive.
  • NB2 Empty text warning: Added logger.warning("Empty text passed to embed(), returning zero vector")

@ZaynJarvis#702 pattern + #718

Verification

ruff check: All checks passed!
pytest: 59 passed (41 embedder + 18 config, incl. 1 new normalization test)
black --check: 1 file would be left unchanged

@chethanuk chethanuk force-pushed the feat/gemini-landing branch from 5ac0ebc to 1586f65 Compare March 19, 2026 10:12
CI uses ruff format (not black); re-ran ruff format to match.
@chethanuk chethanuk force-pushed the feat/gemini-landing branch from 1586f65 to 893a1da Compare March 19, 2026 13:00
@chethanuk
Copy link
Contributor Author

chethanuk commented Mar 19, 2026

@qin-ctx Can we merge? Main keeps changing now tests are working

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

3 participants